{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring Image Uniqueness with FiftyOne\n",
    "\n",
    "During model training, the best results will be seen when training on *unique data samples*. For example, finding and removing similar samples in your dataset can avoid accidental concept imbalance that can bias the learning of your model. Or, if duplicate or near-duplicate data is present in both training and validation/test splits, evaluation results may not be reliable. Just to name a few.\n",
    "\n",
    "This tutorial shows how FiftyOne can automatically find and remove near-duplicate images in your datasets and recommend the most unique samples in your data, enabling you to start your model training off right with a high-quality bootstrapped training set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "In this walkthrough, we explore how FiftyOne's image uniqueness tool can be\n",
    "used to analyze and extract insights from raw (unlabeled) datasets.\n",
    "\n",
    "We'll cover the following concepts:\n",
    "\n",
    "-   Loading a dataset from the FiftyOne Dataset Zoo\n",
    "-   Applying FiftyOne's sample uniqueness algorithm to your dataset\n",
    "-   Launching the FiftyOne App and visualizing/exploring your data\n",
    "-   Identifying duplicate and near-duplicate images in your dataset\n",
    "-   Identifying the most unique/representative images in your dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "This tutorial requires either [Torchvision Datasets](https://pytorch.org/docs/stable/torchvision/datasets.html) or [TensorFlow Datasets](https://www.tensorflow.org/datasets) to download the CIFAR-10 dataset used below.\n",
    "\n",
    "You can, for example, install PyTorch as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Modify as necessary (e.g., GPU install). See https://pytorch.org for options\n",
    "!pip install torch\n",
    "!pip install torchvision"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Finding duplicate and near-duplicate images\n",
    "\n",
    "A common problem in dataset creation is duplicated data. Although this could be\n",
    "found using file hashing---as in the `image_deduplication` walkthrough---it is\n",
    "less possible when small manipulations have occurred in the data. Even more\n",
    "critical for workflows involving model training is the need to get as much\n",
    "power out of each data samples as possible; near-duplicates, which are samples\n",
    "that are exceptionally similar to one another, are intrinsically less valuable\n",
    "for the training scenario. Let's see if we can find such duplicates and\n",
    "near-duplicates in a common dataset: CIFAR-10."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the dataset\n",
    "\n",
    "Open a Python shell to begin. We will use the CIFAR-10 dataset, which is\n",
    "available in the FiftyOne Dataset Zoo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split 'test' already downloaded\n",
      "Loading 'cifar10' split 'test'\n",
      " 100% |█████████████████████████| 10000/10000 [2.3s elapsed, 0s remaining, 4.1K samples/s]      \n"
     ]
    }
   ],
   "source": [
    "import fiftyone as fo\n",
    "import fiftyone.zoo as foz\n",
    "\n",
    "# Load the CIFAR-10 test split\n",
    "# Downloads the dataset from the web if necessary\n",
    "dataset = foz.load_zoo_dataset(\"cifar10\", split=\"test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute uniqueness\n",
    "\n",
    "Now we can process the entire dataset for uniqueness. This is a fairly\n",
    "expensive operation, but should finish in a few minutes at most. We are\n",
    "processing through all samples in the dataset, then building a representation\n",
    "that relates the samples to each other. Finally, we analyze this representation\n",
    "to output uniqueness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded default deployment config for model 'simple_resnet_cifar10'\n",
      "Applied 0 setting(s) from default deployment config\n",
      "Computing uniqueness for 10000 samples...\n",
      " 100% |█████████████████████████| 10000/10000 [48.5s elapsed, 0s remaining, 195.5 samples/s]      \n",
      "Analyzing samples...\n",
      "Uniqueness computation complete\n"
     ]
    }
   ],
   "source": [
    "import fiftyone.brain as fob\n",
    "\n",
    "fob.compute_uniqueness(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above method populates a `uniqueness` field on each sample that contains\n",
    "the sample's uniqueness score. Let's confirm this by printing some information\n",
    "about the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name:           cifar10-test\n",
      "Persistent:     False\n",
      "Num samples:    10000\n",
      "Tags:           ['test']\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "    uniqueness:   fiftyone.core.fields.FloatField\n"
     ]
    }
   ],
   "source": [
    "# Now the samples have a \"uniqueness\" field on them\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'cifar10-test',\n",
      "    'id': '5ef389fea29de7cfed2d2d59',\n",
      "    'filepath': '/home/voxel51/fiftyone/cifar10/test/data/00001.jpg',\n",
      "    'tags': BaseList(['test']),\n",
      "    'ground_truth': <Classification: {'label': 'horse'}>,\n",
      "    'uniqueness': 0.5608251500841676,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "print(dataset.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize to find duplicate and near-duplicate images\n",
    "\n",
    "Now, let's visually inspect the least unique images in the dataset to see if\n",
    "our dataset has any issues:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "App launched\n"
     ]
    }
   ],
   "source": [
    "# Sort in increasing order of uniqueness (least unique first)\n",
    "dups_view = dataset.sort_by(\"uniqueness\")\n",
    "\n",
    "# Launch the App\n",
    "session = fo.launch_app(view=dups_view)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![dataset](images/uniqueness_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will easily see some near-duplicates in the App. It surprised us that\n",
    "there are duplicates in CIFAR-10, too!\n",
    "\n",
    "Of course, in this scenario, near duplicates are identified from visual\n",
    "inspection. So, how do we get the information out of FiftyOne and back into\n",
    "your working environment. Easy! The `session` variable provides a bidirectional\n",
    "bridge between the App and your Python environment. In this case, we will\n",
    "use the `session.selected` bridge. So, in the App, click on some of the\n",
    "duplicates and near-duplicates. Then, execute the following code in the Python\n",
    "shell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get currently selected images from App\n",
    "dup_ids = session.selected\n",
    "\n",
    "# Mark as duplicates\n",
    "dups_view = dataset.select(dup_ids)\n",
    "for sample in dups_view:\n",
    "    sample.tags.append(\"dup\")\n",
    "    sample.save()\n",
    "\n",
    "# Visualize duplicates-only in App\n",
    "session.view = dups_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![dups-view](images/uniqueness_2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the App will only show these samples now. We can, of course access\n",
    "the filepaths and other information about these samples programmatically so you\n",
    "can act on the findings. But, let's do that at the end of Part 2 below!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Bootstrapping a dataset of unique samples\n",
    "\n",
    "When building a dataset, it is important to create a diverse dataset with\n",
    "unique and representative samples. Here, we explore FiftyOne's ability to help\n",
    "identify the most unique samples in a raw dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download some images\n",
    "\n",
    "This walkthrough will process a directory of images and compute their\n",
    "uniqueness. The first thing we need to do is get some images. Let's get some\n",
    "images from Flickr, to keep this interesting!\n",
    "\n",
    "You need a Flickr API key to do this. If you already have a Flickr API key,\n",
    "then skip the next steps.\n",
    "\n",
    "1. Go to <https://www.flickr.com/services/apps/create/>\n",
    "2. Click on Request API Key.\n",
    "   (<https://www.flickr.com/services/apps/create/apply/>) You will need to\n",
    "   login (create account if needed, free).\n",
    "3. Click on \"Non-Commercial API Key\" (this is just for a test usage) and fill\n",
    "   in the information on the next page. You do not need to be very descriptive;\n",
    "   your API will automatically appear on the following page.\n",
    "4. Install the Flickr API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install flickrapi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will also need to enable ETA's storage support to run this script, if you\n",
    "haven't yet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --index https://pypi.voxel51.com voxel51-eta[storage]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's download three sets of images to process together. I suggest using\n",
    "three distinct object-nouns like \"badger\", \"wolverine\", and \"kitten\". For the\n",
    "actual downloading, we will use the provided `query_flickr.py` script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calling {'method': 'flickr.photos.search', 'format': 'etree', 'nojsoncallback': 1}\n",
      "REST Parser: using lxml.etree\n",
      "Calling {'method': 'flickr.photos.search', 'format': 'etree', 'nojsoncallback': 1}\n",
      "REST Parser: using lxml.etree\n",
      "Downloading 50 images matching query 'badger' to 'data/badger'\n",
      "Calling {'method': 'flickr.photos.search', 'format': 'etree', 'nojsoncallback': 1}\n",
      "REST Parser: using lxml.etree\n",
      "Calling {'method': 'flickr.photos.search', 'format': 'etree', 'nojsoncallback': 1}\n",
      "REST Parser: using lxml.etree\n",
      "Downloading 50 images matching query 'wolverine' to 'data/wolverine'\n",
      "Calling {'method': 'flickr.photos.search', 'format': 'etree', 'nojsoncallback': 1}\n",
      "REST Parser: using lxml.etree\n",
      "Calling {'method': 'flickr.photos.search', 'format': 'etree', 'nojsoncallback': 1}\n",
      "REST Parser: using lxml.etree\n",
      "Downloading 50 images matching query 'kitten' to 'data/kitten'\n"
     ]
    }
   ],
   "source": [
    "from query_flickr import query_flickr\n",
    "\n",
    "# Your credentials here\n",
    "KEY = \"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\"\n",
    "SECRET = \"YYYYYYYYYYYYYYYY\"\n",
    "\n",
    "query_flickr(KEY, SECRET, \"badger\")\n",
    "query_flickr(KEY, SECRET, \"wolverine\")\n",
    "query_flickr(KEY, SECRET, \"kitten\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rest of this walkthrough assumes you've downloaded some images to your\n",
    "local `.data/` directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the data into FiftyOne\n",
    "\n",
    "In a Python shell, let's now work through getting this data into FiftyOne and\n",
    "working with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 100% |█████████████████████████████| 160/160 [45.2ms elapsed, 0s remaining, 3.5K samples/s]    \n",
      "Name:           flickr-images\n",
      "Persistent:     False\n",
      "Num samples:    160\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath: fiftyone.core.fields.StringField\n",
      "    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "<Sample: {\n",
      "    'dataset_name': 'flickr-images',\n",
      "    'id': '5ef38abca29de7cfed2d546a',\n",
      "    'filepath': '/home/voxel51/fiftyone/docs/source/tutorials/data/badger/14271824861_122dfd2788_c.jpg',\n",
      "    'tags': BaseList([]),\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "import fiftyone as fo\n",
    "\n",
    "dataset = fo.Dataset.from_images_dir(\n",
    "    \"data\", recursive=True, name=\"flickr-images\"\n",
    ")\n",
    "\n",
    "print(dataset)\n",
    "print(dataset.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above command uses factory method on the `Dataset` class to traverse a\n",
    "directory of images (including subdirectories) and generate a dataset instance\n",
    "in FiftyOne containing those images.\n",
    "\n",
    "Note that the images are not loaded from disk, so this operation is fast. The\n",
    "first argument is the path to the directory of images on disk, and the third is\n",
    "a name for the dataset.\n",
    "\n",
    "With the dataset loaded into FiftyOne, we can easily launch a App and\n",
    "visualize it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "App launched\n"
     ]
    }
   ],
   "source": [
    "session = fo.launch_app(dataset=dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![dataset2](images/uniqueness_3.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please refer to the `User Guides` for more\n",
    "useful things you can do with the dataset and App."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute uniqueness and analyze\n",
    "\n",
    "Now, let's analyze the data. For example, we may want to understand what are\n",
    "the most unique images among the data as they may inform or harm model\n",
    "training; we may want to discover duplicates or redundant samples.\n",
    "\n",
    "Continuing in the same Python shell, let's compute and visualize uniqueness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded default deployment config for model 'simple_resnet_cifar10'\n",
      "Applied 0 setting(s) from default deployment config\n",
      "Computing uniqueness for 160 samples...\n",
      " 100% |█████████████████████████████| 160/160 [1.0s elapsed, 0s remaining, 156.7 samples/s]         \n",
      "Analyzing samples...\n",
      "Uniqueness computation complete\n",
      "Name:           flickr-images\n",
      "Persistent:     False\n",
      "Num samples:    160\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath:   fiftyone.core.fields.StringField\n",
      "    tags:       fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:   fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    uniqueness: fiftyone.core.fields.FloatField\n"
     ]
    }
   ],
   "source": [
    "import fiftyone.brain as fob\n",
    "\n",
    "fob.compute_uniqueness(dataset)\n",
    "\n",
    "# Now the samples have a \"uniqueness\" field on them\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'flickr-images',\n",
      "    'id': '5ef38abca29de7cfed2d546a',\n",
      "    'filepath': '/home/voxel51/fiftyone/docs/source/tutorials/data/badger/14271824861_122dfd2788_c.jpg',\n",
      "    'tags': BaseList([]),\n",
      "    'uniqueness': 0.291738740499072,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "print(dataset.first())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort by uniqueness (most unique first)\n",
    "rank_view = dataset.sort_by(\"uniqueness\", reverse=True)\n",
    "\n",
    "# Visualize in the App\n",
    "session.view = rank_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![rank-view](images/uniqueness_4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, just visualizing the samples is interesting, but we want more. We want to\n",
    "get the most unique samples from our dataset so that we can use them in our\n",
    "work. Let's do just that. In the same Python session, execute the following\n",
    "code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'flickr-images',\n",
      "    'id': '5ef38abca29de7cfed2d54dc',\n",
      "    'filepath': '/home/voxel51/fiftyone/docs/source/tutorials/data/wolverine/2428280852_6c77fe2877_c.jpg',\n",
      "    'tags': BaseList([]),\n",
      "    'uniqueness': 1.0,\n",
      "}>\n",
      "2428280852_6c77fe2877_c.jpg\n",
      "49733688496_b6fc5cde41_c.jpg\n",
      "2843545851_6e1dc16dfc_c.jpg\n",
      "7466201514_0a3c7d615a_c.jpg\n",
      "6176873587_d0744926cb_c.jpg\n",
      "33891021626_4cfe3bf1d2_c.jpg\n",
      "8303699893_a7c14c04d3_c.jpg\n",
      "388994554_34d60d1b18_c.jpg\n",
      "15507757203_9613870ffb_c.jpg\n",
      "49485461331_9b8c8ba983_c.jpg\n"
     ]
    }
   ],
   "source": [
    "# Verify that the most unique sample has the maximal uniqueness of 1.0\n",
    "print(rank_view.first())\n",
    "\n",
    "# Extract paths to 10 most unique samples\n",
    "ten_best = [x.filepath for x in rank_view.limit(10)]\n",
    "\n",
    "for filepath in ten_best:\n",
    "    print(filepath.split('/')[-1])\n",
    "\n",
    "# Then you can do what you want with these.\n",
    "# Output to csv or json, send images to your annotation team, seek additional\n",
    "# similar data, etc."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
